Tortoise is an automated solution designed to meet all Kubernetes resource optimization needs. It shifts optimization responsibility from service owners to platform teams, requiring service owners to configure only a minimal amount of parameters to initiate autoscaling. Tortoise allows for comprehensive tuning by platform teams. It currently only supports deployment - support for all resources supporting scale subresources is in development.
Friday, March 22, 2024GPU provider Lambda has a special debt financing deal for $500m to expand its GPU cloud offering in addition to the $230m Series C earlier this year.
AutoMQ is a cloud-native implementation for Apache Kafka. It’s serverless, works with any cloud provider, and can cut your cloud infrastructure bill by up to 90%.
Modal's HTTP and WebSocket stack for serverless functions allows for seamless web endpoint deployment and real-time communication capabilities. To overcome the limits of traditional serverless platforms, Modal designed a system that can handle demanding workloads, offering extensive CPU, memory, and GPU resources. Modal treats HTTP/WebSocket requests like function calls and leverages Rust and ASGI (Asynchronous Server Gateway Interface) for efficient processing.
Monday, March 18, 2024A Lambda cold start refers to the initial delay experienced when an AWS Lambda function is invoked after being idle. The duration of a cold start varies from under 100 ms to over one second, the latter of which occurs in about 1% of all invocations. This article goes into detail on ways you can optimize cold start latency, such as fine-tuning memory for your function, using ARM64 CPU architecture, optimizing function creation methods, and more.
This author switched a side project to a Kubernetes-based infrastructure, only to find it overly complex, expensive, and difficult to manage. Despite the promise of high availability, the system suffered from slow performance, difficult debugging, and downtime during node failures. While Kubernetes can be powerful, it's important to choose the right tools for the job and not get caught up in complexity for its own sake if it's not necessary.
Allegro, a major e-commerce platform, successfully transformed its monolithic application into a microservices architecture. Along the way, it shifted away from physical servers to a cloud-based infrastructure, which led to a much better developer experience. It also made large investments in tooling and infrastructure to reduce manual work for its engineers. Allegro made sure to allow room for experimentation, but also kept teams accountable for their architectural decisions.
Ubicloud successfully enabled ARM64 VMs with only 151 lines of code by standardizing CPU architecture names, automating architecture detection, and adjusting VM placement logic. It had to deal with challenges like inflexible hardware configurations from its bare metal provider Hetzner.
An MVP of serverless Postgres using Oriole, Fly Machines, and Tigris for S3 Storage.
Checkly, a synthetic monitoring tool, reduced its pod startup time by optimizing the AWS SDK and changing the order of operations. It initially used large containers with long startup times, but switching to ephemeral pods required a faster startup. The company found that multiple versions of the AWS SDK were causing delays. Standardizing the version led to a bunch of savings in compute costs.
Using LLMs for root cause analysis (RCA) for cloud incidents is probably not a great idea. LLMs, while capable of automating some aspects of RCA, lack the depth and nuance of human experts. Furthermore, there's the "automation surprise" problem, where unexpected LLM behavior could lead to dangerous situations due to user misunderstanding.
Amazon S3 now supports conditional writes that can check for the existence of an object before creating it. This helps developers more easily prevent applications from overwriting any existing objects when uploading data. Conditional writes can be used to simplify how distributed applications with multiple clients concurrently update data in parallel across shared datasets. Developers no longer need to build any client-side consensus mechanisms to coordinate updates or use additional API requests to check for the presence of an object before uploading data. The feature is available at no additional charge in all AWS regions.
Cloudflare has introduced a significant enhancement to its Durable Objects (DO) by integrating zero-latency SQLite storage, fundamentally changing how applications can manage data in the cloud. Traditional cloud storage often suffers from latency due to network access and the need for synchronization across multiple clients. However, with Durable Objects, application code runs directly where the data is stored, eliminating the need for context switching and allowing for near-instantaneous access to data. Previously, Durable Objects provided only key/value storage, but the new integration with SQLite allows for a full SQL query interface, complete with tables and indexes. SQLite is widely recognized for its speed and reliability, making it an ideal choice for this new architecture. By embedding SQLite directly within the Durable Objects, Cloudflare enables applications to execute SQL queries with minimal latency, often completing in microseconds. Durable Objects are part of the Cloudflare Workers serverless platform, functioning as small servers that maintain state both in-memory and on-disk. Each DO can be uniquely addressed, allowing for global access and coordination of operations. This architecture is particularly beneficial for applications requiring real-time collaboration, such as document editing, where multiple users can interact with the same data seamlessly. The design of Durable Objects emphasizes scalability by encouraging the creation of multiple objects to handle increased traffic rather than relying on a single object. This approach allows for efficient management of state and traffic distribution across the network. One of the standout features of the new SQLite integration is the synchronous nature of database queries. Unlike traditional asynchronous database calls, which can introduce complexity and potential bugs, the synchronous queries in Durable Objects ensure that the application state remains consistent and predictable. This design choice simplifies coding and enhances performance, as the application can execute queries without waiting for I/O operations to complete. To address concerns about write durability, Cloudflare has implemented a mechanism called "Output Gates." This system allows applications to continue processing without waiting for write confirmations, while still ensuring that responses to clients are only sent after confirming that writes have been successfully stored. This dual approach maintains high throughput and low latency. The integration also simplifies common database issues, such as the "N+1 selects" problem, by allowing developers to write straightforward queries without needing to optimize for performance intricacies. Additionally, SQLite-backed Durable Objects offer point-in-time recovery, enabling users to revert to any state within the last 30 days, providing a safety net against data corruption. Developers can easily implement SQLite-backed Durable Objects by defining their classes and migrations in the Cloudflare environment. The pricing model for this new feature aligns with existing Cloudflare services, offering a competitive structure for SQL queries and storage. In contrast to Cloudflare's D1 product, which is a more managed database solution, SQLite-in-Durable Objects provides a lower-level building block for developers who want more control over their applications. D1 operates within a traditional cloud architecture, while SQLite-in-DO allows for colocated application logic and data storage, offering unique advantages for specific use cases. The underlying technology for this new feature is the Storage Relay Service (SRS), which efficiently manages data persistence by combining local disk speed with the durability of object storage. SRS records changes in a log format and utilizes a network of follower machines to ensure data integrity and availability. Overall, the introduction of zero-latency SQLite storage in Durable Objects represents a significant advancement in cloud computing, enabling developers to build faster, more reliable applications with enhanced data management capabilities.